Automatic Thesaurus Generation using Co-occurrence
نویسندگان
چکیده
This paper proposes a characterization of useful thesaurus terms by the informativity of cooccurence with that term. Given a corpus of documents, informativity is formalized as the information gain of the weighted average term distribution of all documents containing that term. While the resulting algorithm for thesaurus generation is unsupervised, we find that high informativity terms correspond to large and coherent subsets of documents. We evaluate our method on a set of Dutch Wikipedia articles by comparing high informativity terms with keywords for the Wikipedia category of the articles.
منابع مشابه
Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus
Fully Automatic Thesaurus Generation (ATG) seeks to generate useful thesauri by mining a corpus of raw text. A number of statistical approaches, based on term co occurrence, exist for this, but in general they are only able to estimate the strength of the relationship between two terms, not its nature. In this paper we implement Hearst's method of discovering the hyponymy relations which are t...
متن کاملAd Hoc Retrieval Experiments Using WordNet and Automatically Constructed Thesauri
This paper describe our method in automatic-adhoc task of TREC-7. We propose a method to improve the performance of information retrieval system by expanded the query using 3 di ferent types of thesaurus. The expansion terms are taken from handcrafted thesaurus (WordNet), co-occurrence-based automatically constructed thesaurus, and syntactically predicate-argument based automatically constructe...
متن کاملAlleviating Search Uncertainty Through Concept Associations: Automatic Indexing, Co-Occurrence Analysis, and Parallel Computing
In this article, we report research on an algorithmic apgather, process, and retrieve information. These systems proach to alleviating search uncertainty in a large inforprovide a wide variety of information and services, rangmation space. Grounded on object filtering, automatic ing from daily updates of foreign and national news, indexing, and co-occurrence analysis, we performed a movie revie...
متن کاملEnglish-Japanese Cross-lingual Query Expansion Using Random Indexing of Aligned Bilingual Text Data
Vector space models can be used for extracting semantically similar words from the co-occurrence statistics of words in large text data. In this paper, we report on our NTCIR 2002 experiments using the Random Indexing vector space method for extracting an English-Japanese cross-lingual thesaurus from aligned English-Japanese bilingual data. The crosslingual thesaurus has been used for automatic...
متن کاملGraph-based Word Clustering using a Web Search Engine
Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...
متن کامل